Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Substitute zdnn calls for stick/unstick late, after most ZLow optimizations are performed #2812

Merged

Conversation

AlexandreEichenberger
Copy link
Collaborator

@AlexandreEichenberger AlexandreEichenberger commented May 2, 2024

The many prior instances used buffered / tiled stick unstick code gen, but the redundant load / store ended up costing more than the simple approach of stick/unstick and store.

Also, the performance of dynamic shapes were quite lagging, because many key optimizations to remove unnecessary stick / unstick rely on ZLow stick/unstick patterns. Thus converting stick/unstick early (in the ZHigh to ZLow were preventing these opts.

So now I have a separate pass that changes the ZLow stick/unstick after zlow rewrite operations, and it now results in performance speedup.

Detailed results are available, the short story is that using CSU (compiler gen stick/unstick) Stick instead of zDNN Stick is 1.56x faster for Roberta (dyn shapes), and 6.71 x faster when using 6 threads.

For unstick, the speedup is 1.52 and 5.73x faster

Results for entire Roberta base with Batch size of 6 and 6 threads is a speedup of 1.04x in sequential mode, and 1.27x in parallel mode.

image

Note that this optimization still relies on having a LLVM with the affine.prefetch that accepts zAIU formats (aka div/mod affine expressions). The PR in LLVM was accepted and merged, it has to filter its way back into our LLVM.

Code was verified by small corner examples and verifying the output of Roberta with/without it.

Signed-off-by: Alexandre Eichenberger <[email protected]>
Signed-off-by: Alexandre Eichenberger <[email protected]>
Signed-off-by: Alexandre Eichenberger <[email protected]>
Signed-off-by: Alexandre Eichenberger <[email protected]>
Signed-off-by: Alexandre Eichenberger <[email protected]>
Signed-off-by: Alexandre Eichenberger <[email protected]>
Signed-off-by: Alexandre Eichenberger <[email protected]>
Signed-off-by: Alexandre Eichenberger <[email protected]>
Signed-off-by: Alexandre Eichenberger <[email protected]>
Signed-off-by: Alexandre Eichenberger <[email protected]>
Signed-off-by: Alexandre Eichenberger <[email protected]>
Signed-off-by: Alexandre Eichenberger <[email protected]>
Signed-off-by: Alexandre Eichenberger <[email protected]>
Signed-off-by: Alexandre Eichenberger <[email protected]>
Signed-off-by: Alexandre Eichenberger <[email protected]>
Signed-off-by: Alexandre Eichenberger <[email protected]>
Signed-off-by: Alexandre Eichenberger <[email protected]>
Signed-off-by: Alexandre Eichenberger <[email protected]>
Signed-off-by: Alexandre Eichenberger <[email protected]>
Signed-off-by: Alexandre Eichenberger <[email protected]>
Signed-off-by: Alexandre Eichenberger <[email protected]>
Signed-off-by: Alexandre Eichenberger <[email protected]>
Signed-off-by: Alexandre Eichenberger <[email protected]>
Signed-off-by: Alexandre Eichenberger <[email protected]>
Signed-off-by: Alexandre Eichenberger <[email protected]>
Signed-off-by: Alexandre Eichenberger <[email protected]>
Signed-off-by: Alexandre Eichenberger <[email protected]>
Signed-off-by: Alexandre Eichenberger <[email protected]>
Signed-off-by: Alexandre Eichenberger <[email protected]>
Signed-off-by: Alexandre Eichenberger <[email protected]>
Signed-off-by: Alexandre Eichenberger <[email protected]>
Signed-off-by: Alexandre Eichenberger <[email protected]>
Signed-off-by: Alexandre Eichenberger <[email protected]>
Signed-off-by: Alexandre Eichenberger <[email protected]>
Signed-off-by: Alexandre Eichenberger <[email protected]>
@AlexandreEichenberger
Copy link
Collaborator Author

@tungld the lowering is done from zlow to krnl late, but it seems to work well. Let me know if you have reservations.

Signed-off-by: Alexandre Eichenberger <[email protected]>
Copy link
Collaborator

@tungld tungld left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Glad to hear we get performance improvement in the dynamic cases. I also prefer the lower-level generation of stick/unstick when most of optimizations have been applied.

@@ -467,6 +468,31 @@ class AddSubWithRHSZeroExpandPattern : public OpRewritePattern<OP_TYPE> {
}
};

class RemoveReshapeWithIdentityPattern
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's clever to do it here. Thanks!

@@ -643,6 +671,15 @@ void getRewriteONNXForZHighDynamicallyLegal(
return isSuitableForZDNN<ONNXConvOp>(op) ||
!canInferencePadsForNNPAConv(op);
});
#if 1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess you will remove this in the final version.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tx

using MDBuilder = MultiDialectBuilder<IndexExprBuilderForKrnl, KrnlBuilder,
MathBuilder, MemRefBuilder, VectorBuilder, AffineBuilder, SCFBuilder>;

/// Remove unstick if there is no use of its second operand except itself.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment does not seem to reflect the following pattern.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch

}
};

/// Remove stick if there is no use of its second operand except itself.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment does not seem to reflect the following pattern.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tx

Signed-off-by: Alexandre Eichenberger <[email protected]>
@AlexandreEichenberger
Copy link
Collaborator Author

@jenkins-droid test this please

1 similar comment
@AlexandreEichenberger
Copy link
Collaborator Author

@jenkins-droid test this please

@AlexandreEichenberger AlexandreEichenberger merged commit 893cf89 into onnx:main May 9, 2024
6 of 7 checks passed
@jenkins-droid
Copy link
Collaborator

Jenkins Linux ppc64le Build #13825 [push] Substitute zdnn calls fo... started at 12:51

@jenkins-droid
Copy link
Collaborator

Jenkins Linux amd64 Build #14795 [push] Substitute zdnn calls fo... started at 11:39

@jenkins-droid
Copy link
Collaborator

Jenkins Linux s390x Build #14800 [push] Substitute zdnn calls fo... started at 12:39

@jenkins-droid
Copy link
Collaborator

Jenkins Linux amd64 Build #14795 [push] Substitute zdnn calls fo... passed after 1 hr 5 min

@jenkins-droid
Copy link
Collaborator

Jenkins Linux s390x Build #14800 [push] Substitute zdnn calls fo... passed after 1 hr 28 min

@jenkins-droid
Copy link
Collaborator

Jenkins Linux ppc64le Build #13825 [push] Substitute zdnn calls fo... passed after 1 hr 56 min

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants